Lecture 3: Data wrangling basics in R

EC7412 Part II: Data Science for Economists

Adam Altmejd Selder

Swedish Institute for Social Research (SOFI)

April 16, 2025

Introduction

  • Introduction

  • Basic R syntax

  • When things don’t work

  • Logic

  • Subsetting

  • Scoping

Introduction

Why manipulate data with code?

 

Introduction

Today

  • R programming from fundamentals
  • Much of what you will learn applies to any programming language
  • Follow along on your computers!

Introduction

R fundamentals

  • “Everything is and object and everything has a name”
    • You can access and modify everything you interact with
  • You call functions to operate on objects

Introduction

Useful to know when coming from Stata

  • Not limited to one dataset in memory
  • Stata commands = R functions
  • Less “hand-holding” when it comes to statistical analyses, need to be more mindful about what you do
  • Much more dependent on external packages, you will need to learn to install and load them

Introduction

Coding environment recap

  • Make sure you have R and VS Code installed
  • Also useful to have the VS Code R extension
  • You can follow along in the VS Code terminal

Basic R syntax

  • Introduction

  • Basic R syntax

  • When things don’t work

  • Logic

  • Subsetting

  • Scoping

Basic R syntax

Operations

We can use R for arithmetic by executing operations in an R terminal:

12 + 3
[1] 15
8 - 33
[1] -25
20 / 3
[1] 6.666667
10 ^(2 + 1) - 1
[1] 999

Each row calls arithmetic functions on scalar objects (numbers).

Basic R syntax

Running code

Options:

  • Manually enter code into terminal
  • Ctrl/cmd+enter to send row to terminal in VS Code
  • Run whole R-script file (.R) with Ctrl/cmd+shift+S

Basic R syntax

Assignment

We create new objects by assigning values to names:

a <- 3
b = 5

Creates two scalars and assign them the names "a" and "b".

  • You can use both <- and = for assignment in R.
  • Almost everyone uses <-, but it has some issues. For example:
a < -5
a <- 5

Basic R syntax

Objects

  • typeof() tells us how objects are stored in memory
  • Data will usually be stored in one of 4 types:
typeof(1.0)
[1] "double"
typeof(1L)
[1] "integer"
typeof(TRUE)
[1] "logical"
typeof("text")
[1] "character"
  • But sometimes, object class() is more informative:
typeof(mean)
[1] "closure"
class(mean)
[1] "function"

Basic R syntax

Naming

  • A name must start with a letter and can contain letters, numbers, underscores, and periods.
1a = 1
Error in parse(text = input): <text>:1:2: unexpected symbol
1: 1a
     ^
function = 1
Error in parse(text = input): <text>:1:10: unexpected '='
1: function =
             ^
  • Names are case sensitive!
  • No length limit: household_income_after_tax is much better than hhinc2.

Basic R syntax

Vectors and lists

c(1,2,3)
[1] 1 2 3

Elements of vectors need to be of same type, or they will be coerced:

c(1,2,"3")
[1] "1" "2" "3"

Lists are collections of arbitrary types:

list(2,"3")
[[1]]
[1] 2

[[2]]
[1] "3"

Basic R syntax

Missing values

  • Each type has a “missing value” indicator NA (not available)
c(1,NA,-1) > 0
[1]  TRUE    NA FALSE
  • For numbers, there is also Inf, -Inf and NaN (not a number)
1/0
[1] Inf
0/0
[1] NaN
is.na(c(NA, NaN))
[1] TRUE TRUE
is.nan(c(NA, NaN))
[1] FALSE  TRUE

Basic R syntax

Data frames

Data frames are collections (lists really) of vectors of the same length, organized in a table:

df = data.frame(x = c(1,2,3),
                y = c(4,5,6))
df
  x y
1 1 4
2 2 5
3 3 6

Processing data in R, you will run most of your operations on data.frames (or similar objects like tibbles or data.tables).

df^2
  x  y
1 1 16
2 4 25
3 9 36

Basic R syntax

Functions

  • Like programs in STATA, functions are reusable code “blocks”
  • Can take arguments that become variables inside the function
  • You should strive to write as much code as possible in functions
myfunc = function(v1, v2) max(v1, v2)
myfunc(1,2)
[1] 2
myfunc(1,2,3)
Error in myfunc(1, 2, 3): unused argument (3)
myfunc(c(1,2))
Error in myfunc(c(1, 2)): argument "v2" is missing, with no default

Baisc R syntax

Pipes

Pipes (|>) make your code more readable:

sqrt(max(append(c(1,2,3), 4)))
[1] 2
c(1,2,3) |>
  append(4) |>
  max() |>
  sqrt()
[1] 2

Baisc R syntax

Comments

Lines starting with # are not executed. Use comments to explain what you do, and to get code-completion suggestions from Copilot.

# myfunc(x, y) returns largest number of v1 and v2
myfunc = function(v1, v2) max(v1, v2)

But: “Good code explains itself”

return_largest_number = function(v1, v2) max(v1, v2)

When things don’t work

  • Introduction

  • Basic R syntax

  • When things don’t work

  • Logic

  • Subsetting

  • Scoping

When things don’t work

Get help

  • Run ?named_object
  • For packages you can also often run ?package.name.
?data.frame

 

When things don’t work

Understanding error messages

dat = data.frame(a = c(1,2),
                 b = c(3,4))
data[1]
Error in data[1]: object of type 'closure' is not subsettable

!?!?

typeof(data)
[1] "closure"
typeof(dat)
[1] "list"

Learning to read error messages will help you find coding errors.

When things don’t work

Understand the objects you are working with

Print the object in the terminal:

dat
  a b
1 1 3
2 2 4
  • View() is useful for larger objects, try View(mtcars)

  • str() gives you the structure of the object

  • pillar::glimpse() is a nice alternative to str()

  • For functions, printing can1 give you the code. Try printing data.

When things don’t work

Debugging basics

  • Debugging is really useful for finding and fixing errors
  • Use browser() to enter the debugger at a certain place
return_largest_number = function(v1, v2) {
  browser()
  max(v1, -v2)
}
  • VS Code has an R debugger extension that makes this easier

Logic

  • Introduction

  • Basic R syntax

  • When things don’t work

  • Logic

  • Subsetting

  • Scoping

Logic

We often want to run operations conditional or certain criteria. For this we need logical operators.

1 == 2
[1] FALSE
1 < 2
[1] TRUE
1 == 2 & 1 < 2
[1] FALSE
1 == 2 | 1 < 2
[1] TRUE

Note the difference between assignment (=) and comparison (==).

Logic

Precedence

Let’s say we want to check if a variable is larger than two other variables:

x = 3; y = 1; z = 4
x > y & z
[1] TRUE

Logical operators (>, ==, etc) are evaluated before Boolean (& and |).

3 > 1
[1] TRUE
TRUE & 4
[1] TRUE

Boolean operators require logical arguments. R runs as.logical(4) before the comparison. All non-zero numbers are coerced to TRUE, only 0 is FALSE.

To get it right we need to be explicit:

x > 1 & x > 4
[1] FALSE

Logic

Logic on vectors

c(1,2,3) > 2
[1] FALSE FALSE  TRUE
c(1,2,3) > c(0,1,2)
[1] TRUE TRUE TRUE

But! When the vectors are of different lenght, R “recycles” the shorter vector.

c(1,2,3) > c(1,3)
Warning in c(1, 2, 3) > c(1, 3): longer object length is not a multiple of
shorter object length
[1] FALSE FALSE  TRUE

When one object is a multiple of the other, there is not even a warning.

c(1,2,3) > c(0,1,2,3,4,5)
[1]  TRUE  TRUE  TRUE FALSE FALSE FALSE

Be careful!

Logic

Boolean operators on vectors

The same is true for Boolean operators:

c(TRUE, FALSE) | c(FALSE, TRUE)
[1] TRUE TRUE
c(TRUE, FALSE) & c(TRUE, FALSE, FALSE)
Warning in c(TRUE, FALSE) & c(TRUE, FALSE, FALSE): longer object length is not
a multiple of shorter object length
[1]  TRUE FALSE FALSE

To require scalars for our comparison, we can use && and || instead:

c(TRUE, FALSE) && c(FALSE, TRUE)
Error in c(TRUE, FALSE) && c(FALSE, TRUE): 'length = 2' in coercion to 'logical(1)'

Logic

Missing values

In R, NA are properly treated as “missing”, the value could be anything.

NA > 0
[1] NA
NA & TRUE
[1] NA

But:

NA | TRUE
[1] TRUE
NA & FALSE
[1] FALSE

Here, it does not matter what NA could be, since both TRUE | TRUE and FALSE | TRUE evalute to TRUE.

Logic

Missing values (cont.)

What if we want to check what values of a vector are missing:

x = c(1, NA, 3)
x == NA
[1] NA NA NA

Actually, even NA is not equal to NA.

NA == NA
[1] NA

Instead we need to use is.na()

is.na(x)
[1] FALSE  TRUE FALSE

Logic

Negation

! negates a logical statement:

!TRUE
[1] FALSE
1 != 2
[1] TRUE
!(1 < 3)
[1] FALSE

For example we might want to filter for non-missing by running !is.na():

!is.na(x)
[1]  TRUE FALSE  TRUE

Logic

%in%

To check if a scalar is an element of a vector:

1 %in% c(1,2,3)
[1] TRUE

Works just as well on character vectors

c("a", "c") %in% c("a", "b", "c")
[1] TRUE TRUE

Logic

Caution: floating-point numbers

Arithmetic operations on floating points are not exact:

0.3 == 0.3
[1] TRUE
0.1 + 0.2 == 0.3
[1] FALSE

Instead do:

all.equal(0.3, 0.1 + 0.2)
[1] TRUE
abs(0.3 - (0.1 + 0.2)) < 1e-15
[1] TRUE

Logic

Control flow: if/else

if (1 == 3) "foo" else "bar"
[1] "bar"

Can be used directly in assignment (not recommended)

x = 0 + if (1 == 3) 2 else 0
x
[1] 0

If statements should normally be written on multiple rows with brackets:

if (TRUE) {
  "foo"
} else {
  "bar"
}
[1] "foo"

Logic

Control flow: vectors

if can only evaluate logical scalars, for vectors, use ifelse()

if (c(3,1) == c(1,4)) "foo" else "bar"
Error in if (c(3, 1) == c(1, 4)) "foo" else "bar": the condition has length > 1
ifelse(c(3,1) == 1, "foo", "bar")
[1] "bar" "foo"

We can nest multiple ifelse():

x <- c(1,2,3)
ifelse(x==1, "a", ifelse(x==2, "b", NA))
[1] "a" "b" NA 

Or use data.table::fcase() (or dplyr::case_when())

data.table::fcase(
  x == 1, "a",
  x == 2, "b"
)
[1] "a" "b" NA 

Subsetting

  • Introduction

  • Basic R syntax

  • When things don’t work

  • Logic

  • Subsetting

  • Scoping

Subsetting

Subsetting with []

a = c(4,5,8,1)

a[n] selects the n:th element of the vector a

a[2]
[1] 5

We can supply integers:

a[c(1,3,1)]
[1] 4 8 4
a[-3]
[1] 4 5 1

Or logicals:

a[a > 4]
[1] 5 8

Subsetting

Selecting with [[]] or $

a_list = list(named_elem = c(1,2,3),
               c("a", "b", "c"))
a_list
$named_elem
[1] 1 2 3

[[2]]
[1] "a" "b" "c"

Let’s pick the first element of the list:

a_list[1]
$named_elem
[1] 1 2 3
a_list[[1]]
[1] 1 2 3

Do you see the difference? [] returns a list of length 1 while [[]] returns the vector stored as the first element of a_list (try typeof() to see).

Subsetting

Selecting with [[]] or $ (cont.)

a$x is a shorthand for a[["x"]]

a_list$named_elem
[1] 1 2 3

When [[]] returns a vector we can simply follow with [] to select specific vector element(s):

a_list[[2]][1]
[1] "a"
a_list$named_elem[3]
[1] 3

Scoping

  • Introduction

  • Basic R syntax

  • When things don’t work

  • Logic

  • Subsetting

  • Scoping

Scoping

Function scope

Variables defined in functions are not “global”

x = 10
set_xy = function() {
  x = 20
  y = 30
  c(x,y)
}
set_xy()
[1] 20 30
x
[1] 10
y
Error: object 'y' not found

Scoping

Lexical scoping

But what if the object does not exist in the function scope?

x = 10
y = 20
set_xy = function() {
  x = 20
  c(x, y)
}
set_xy()
[1] 20 20

R looks “one level up”

Scoping

Accessing data.frame columns

Let’s say we want to run a regression on the data in the df data.frame we created

df
  x y
1 1 4
2 2 5
3 3 6
lm(y ~ x)
Error in eval(predvars, data, env): object 'y' not found

R cannot find the y variable. We need to look “inside” the df object.

Scoping

Accessing data.frame columns (cont.)

We can supply each vector separately

lm(df$y ~ df$x)

…or we can use with() to call lm() “inside” df:

with(df, lm(y ~ x))

…or we can use the data argument of lm()

lm(y ~ x, data = df) # check ?lm for all arguments

Scoping

Packages

  • Packages are R extensions that can be really useful
  • Installed with install.packages() and loaded with library()
library(memer)
meme_get("OprahGiveaway") |>
  meme_text_bottom("EVERYONE GETS AN A",
                   size = 36)

 

Scoping

Packages (cont.)

  • library() loads package functions into global environment
  • Call package functions that are not loaded with ::
plt()
Error in plt(): could not find function "plt"
tinyplot::plt(
  Sepal.Length ~ Petal.Length | Species,
  data = iris
)

 

Scoping

Methods

Lets try to look at the source code for the mean() function

mean
function (x, ...) 
UseMethod("mean")
<bytecode: 0x12767e328>
<environment: namespace:base>

mean() calls different methods depending on object class

methods(mean)
[1] mean.Date        mean.default     mean.difftime    mean.IDate*     
[5] mean.ITime*      mean.POSIXct     mean.POSIXlt     mean.quosure*   
[9] mean.vctrs_vctr*
see '?methods' for accessing help and source code
mean(c(1,2,3,4))
[1] 2.5
mean(as.Date(c("2024-01-01",
               "2025-01-01")))
[1] "2024-07-02"

Next lecture: Visualization